Copy, please, these files and directories to your personal directory:
cp -r /data/shared/AGE2020/Exercises/E07-RNA_seq/03_exploratory_analysis ~/AGE2020/Exercises/E07-RNA_seq
And switch the R working directory to the current exercise: setwd("~/AGE2020/Exercises/E07-RNA_seq/03_exploratory_analysis")
In the previous tutorial we learned how to go from raw RNA-seq reads to count matrix imported to R. We were using a subset of reads from the airway experiment. However, a full version of this experiment is available in the airway package (as the SummarizedExperiment object). We will further work with this object.
Some vector/matrix operations can be speed-up by parallelization. Package BiocParallel offers parallelized versions of functions from the apply family. For example, parallelized version of lapply() is bplapply(). Many functions in DESeq2 can be run parallelized.
We create a BPPARAM variable containing information about parallelization parameters (2 threads). We will reuse this object later. If we call register() with our BPPARAM, some packages (e.g. DESeq2), will then run parallelized automatically, using our BPPARAM.
library(BiocParallel)
BPPARAM <- MulticoreParam(workers = 2)
register(BPPARAM)
Now let’s load the SummarizedExperiment object of the airway data:
source("../../age_library.R")
library(magrittr)
library(airway)
data("airway")
se <- airway
se
## class: RangedSummarizedExperiment
## dim: 64102 8
## metadata(1): ''
## assays(1): counts
## rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
## rowData names(0):
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
colData(se)
## DataFrame with 8 rows and 9 columns
## SampleName cell dex albut Run avgLength Experiment Sample BioSample
## <factor> <factor> <factor> <factor> <factor> <integer> <factor> <factor> <factor>
## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126 SRX384345 SRS508568 SAMN02422669
## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126 SRX384346 SRS508567 SAMN02422675
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126 SRX384349 SRS508571 SAMN02422678
## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87 SRX384350 SRS508572 SAMN02422670
## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120 SRX384353 SRS508575 SAMN02422682
## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126 SRX384354 SRS508576 SAMN02422673
## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101 SRX384357 SRS508579 SAMN02422683
## SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98 SRX384358 SRS508580 SAMN02422677
So far we have been working with four samples simply divided to untrt and trt groups (trt was dexamethasone). Now there is an additional division of samples by cell line (cell column).
You can see that the se object is an instance of class RangedSummarizedExperiment. In this object there is an additional list (GRangesList) holding genomic coordinates (GRanges) of rows (features - genes) of the count matrix (assays). GRangesList can be accessed by rowRanges() function:
rowRanges(se)
## GRangesList object of length 64102:
## $ENSG00000000003
## GRanges object with 17 ranges and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] X 99883667-99884983 - | 667145 ENSE00001459322
## [2] X 99885756-99885863 - | 667146 ENSE00000868868
## [3] X 99887482-99887565 - | 667147 ENSE00000401072
## [4] X 99887538-99887565 - | 667148 ENSE00001849132
## [5] X 99888402-99888536 - | 667149 ENSE00003554016
## ... ... ... ... . ... ...
## [13] X 99890555-99890743 - | 667156 ENSE00003512331
## [14] X 99891188-99891686 - | 667158 ENSE00001886883
## [15] X 99891605-99891803 - | 667159 ENSE00001855382
## [16] X 99891790-99892101 - | 667160 ENSE00001863395
## [17] X 99894942-99894988 - | 667161 ENSE00001828996
## -------
## seqinfo: 722 sequences (1 circular) from an unspecified genome
##
## ...
## <64101 more elements>
You can see that GRangesList has a length of 64102 (= number of rows in se) and its keys are ENSEMBL IDs.
We don’t go deeper now, for more information see ?RangedSummarizedExperiment. I would just note that you can use subset() function to filter features based on GRanges columns (e.g. start, end), and that could be handy.
Let’s go back to our experimental data. We want to be sure that untrt is the reference level for the dex variable:
se$dex <- relevel(se$dex, "untrt")
se$dex
## [1] untrt trt untrt trt untrt trt untrt trt
## Levels: untrt trt
From se object we can immediately create a DESeqDataSet object:
library(DESeq2)
dds <- DESeqDataSet(se, design = ~ cell + dex)
Let’s annotate genes by using the code from previous tutorial. Note that a custom function for this repeating code would be neat 🙂
library(org.Hs.eg.db)
ann <- AnnotationDbi::select(org.Hs.eg.db, keys = rownames(dds), keytype = "ENSEMBL", columns = c("SYMBOL", "GENENAME"))
ann <- ann[!duplicated(ann$ENSEMBL), ]
rownames(ann) <- ann$ENSEMBL
ann <- ann[rownames(dds), ]
stopifnot(all(rownames(ann) == rownames(dds)))
rowData(dds) <- cbind(rowData(dds), ann)
In RNA-seq, count matrices are usually quite sparse, i.e. some genes have zero counts in almost all samples. We can quickly remove these genes (rows) and consequently speed-up an analysis and reduce the size of the object. We remove all rows having zero or one count across the samples:
nrow(dds)
## [1] 64102
dds <- dds[rowSums(counts(dds)) > 1, ]
nrow(dds)
## [1] 29391
BAM! More than half of the rows disappeared.
We can now prepare the dds object for the next steps. DESeq() function will do several things, shortly:
dds <- DESeq(dds, parallel = TRUE)
Some parts of the following text and images are taken from here.
The counts of mapped reads for each gene is proportional to the expression of RNA (“interesting”) in addition to many other factors (“uninteresting”). Normalization is the process of scaling raw count values to account for the “uninteresting” factors. In this way the expression levels are more comparable between and/or within samples.
While normalization is essential for differential expression analyses, it is also necessary for exploratory data analysis, visualization of data, and whenever you are exploring or comparing counts between or within samples. However, do not use normalized counts for differential expression testing. Packages for DE testing (such as DESeq2) expects raw counts and apply custom algorithms for normalization.
The main factors often considered during normalization are:
Accounting for sequencing depth is necessary for comparison of gene expression between samples. In the example below, each gene appears to have doubled in expression in Sample A relative to Sample B, however this is a consequence of Sample A having double the sequencing depth. NOTE: Each pink and green rectangle represents a read aligned to a gene. Reads connected by dashed lines represents reads spanning an intron.
Accounting for gene length is necessary for comparing expression between different genes within the same sample. In the example, Gene X and Gene Y have similar levels of expression, but the number of reads mapped to Gene X would be many more than the number mapped to Gene Y because Gene X is longer.
A few highly differentially expressed genes between samples, differences in the number of genes expressed between samples, or presence of contamination can skew some types of normalization methods. Accounting for RNA composition is recommended for accurate comparison of expression between samples, and is particularly important when performing differential expression analyses (source). In the example, if we were to divide each sample by the total number of counts to normalize, the counts would be greatly skewed by the DE gene, which takes up most of the counts for Sample A, but not Sample B. Most other genes for Sample A would be divided by the larger number of total counts and appear to be less expressed than those same genes in Sample B.
Several common normalization methods exist to account for these differences:
| Normalization method | Description | Accounted factors | Recommendations for use |
|---|---|---|---|
| CPM/FPM (counts/fragments per million) | read counts/fragments scaled by total number of reads/fragments | sequencing depth | gene count comparisons between replicates of the same samplegroup; NOT for within sample comparisons or DE analysis |
| TPM (transcripts per kilobase million) | counts per length of transcript (kb) per million reads mapped | sequencing depth and gene length | gene count comparisons within a sample or between samples of the same sample group; NOT for DE analysis |
| RPKM/FPKM (reads/fragments per kilobase of exon per million reads/fragments mapped) | similar to TPM, but the total number of RPKM/FPKM normalized counts for each sample will be different | sequencing depth and gene length | gene count comparisons between genes within a sample; NOT for between sample comparisons or DE analysis |
| DESeq2’s median of ratios | counts divided by sample-specific size factors determined by median ratio of gene counts relative to geometric mean per gene | sequencing depth and RNA composition | gene count comparisons between samples and for DE analysis; NOT for within sample comparisons |
For between sample comparison, we have to estimate library size, i.e. sequencing depth.
Counts/fragments per million (CPM/FPM) mapped reads/fragments are counts (\(C_i\)) scaled by the number of reads/fragments you sequenced (\(N\)) times one million. This unit is related to the FPKM without length normalization and a factor of \(10^3\):
\[CPM_i = \frac{C_i}{\frac{N}{10^6}} = \frac{C_i}{N} \cdot 10^6\]
This is the most simple normalization accounting only for sequencing depth. It is usable only for gene count comparisons between replicates of the same sample group where significant changes in gene expression are not expected.
Since tools for differential expression analysis are comparing the counts between sample groups for the same gene, gene length does not need to be accounted for by the tool. However, sequencing depth and RNA composition do need to be taken into account.
To normalize for sequencing depth and RNA composition, DESeq2 uses the median of ratios method. On the user-end there is only one step, but on the back-end there are multiple steps involved, as described below.
NOTE: The steps below describe in detail some of the steps performed by DESeq2 when you run a single function to get DE genes. Basically, for a typical RNA-seq analysis, you would not run these steps individually.
Step 1: create a pseudo-reference sample (row-wise geometric mean)
For each gene, a pseudo-reference sample is created that is equal to the geometric mean across all samples.
| gene | sampleA | sampleB | pseudo-reference sample |
|---|---|---|---|
| EF2A | 1489 | 906 | sqrt(1489 * 906) = 1161.5 |
| ABCD1 | 22 | 13 | sqrt(22 * 13) = 17.7 |
| … | … | … | … |
Step 2: calculates ratio of each sample to the reference
For every gene in a sample, the ratios (sample/pseudo-reference) are calculated (as shown below). This is performed for each sample in the dataset. Since the majority of genes are not differentially expressed, the majority of genes in each sample should have similar ratios within the sample.
| gene | sampleA | sampleB | pseudo-reference sample | ratio of sampleA/ref | ratio of sampleB/ref |
|---|---|---|---|---|---|
| EF2A | 1489 | 906 | 1161.5 | 1489/1161.5 = 1.28 | 906/1161.5 = 0.78 |
| ABCD1 | 22 | 13 | 16.9 | 22/16.9 = 1.30 | 13/16.9 = 0.77 |
| MEFV | 793 | 410 | 570.2 | 793/570.2 = 1.39 | 410/570.2 = 0.72 |
| BAG1 | 76 | 42 | 56.5 | 76/56.5 = 1.35 | 42/56.5 = 0.74 |
| MOV10 | 521 | 1196 | 883.7 | 521/883.7 = 0.590 | 1196/883.7 = 1.35 |
| … | … | … | … |
Step 3: calculate the normalization factor for each sample (size factor)
The median value (column-wise for the above table) of all ratios for a given sample is taken as the normalization factor (size factor) for that sample, as calculated below. Notice that the differentially expressed genes should not affect the median value:
normalization_factor_sampleA <- median(c(1.28, 1.3, 1.39, 1.35, 0.59))
normalization_factor_sampleB <- median(c(0.78, 0.77, 0.72, 0.74, 1.35))
The median of ratios method makes the assumption that not ALL genes are differentially expressed; therefore, the normalization factors should account for sequencing depth and RNA composition of the sample (large outlier genes will not represent the median ratio values). This method is robust to imbalance in up-/down-regulation and large numbers of differentially expressed genes.
NOTE: Usually these size factors are around 1, if you see large variations between samples it is important to take note since it might indicate the presence of extreme outliers.
Step 4: calculate the normalized count values using the normalization factor
This is performed by dividing each raw count value in a given sample by that sample’s normalization factor to generate normalized count values. This is performed for all count values (every gene in every sample). For example, if the median ratio for SampleA was 1.3 and the median ratio for SampleB was 0.77, you could calculate normalized counts as follows:
SampleA median ratio = 1.3
SampleB median ratio = 0.77
Raw Counts
| gene | sampleA | sampleB |
|---|---|---|
| EF2A | 1489 | 906 |
| ABCD1 | 22 | 13 |
| … | … | … |
Normalized counts
| gene | sampleA | sampleB |
|---|---|---|
| EF2A | 1489 / 1.3 = 1145.39 | 906 / 0.77 = 1176.62 |
| ABCD1 | 22 / 1.3 = 16.92 | 13 / 0.77 = 16.88 |
| … | … | … |
NOTE: Please note that normalized count values are not whole numbers.
Normalized counts can be retrieved by:
counts(dds, normalized = TRUE) %>% head()
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003 663.31418 499.90698 740.152838 608.90630 966.3137310 748.37220 836.24873 605.60240
## ENSG00000000419 456.21167 574.66985 526.500473 544.73235 498.4412655 571.10734 452.87756 537.84269
## ENSG00000000457 253.99365 235.44726 222.978461 244.75646 208.0376662 236.59140 253.04669 242.45271
## ENSG00000000460 58.61392 61.37251 33.913074 52.23461 66.2323998 45.03099 82.53884 63.52473
## ENSG00000000938 0.00000 0.00000 1.695654 0.00000 0.8491333 0.00000 0.00000 0.00000
## ENSG00000000971 3175.89749 4105.26291 5237.026440 6345.75883 5707.0251194 7881.85315 5621.32909 8464.66992
For between gene comparison, we have to estimate both library size and “gene size”, i.e. some measure of transcript length. The simplest method is to take the median of transcript lengths, which is calculated as the sum (in base pairs) of each transcript’s exons. More sophisticated methods/tools such as Salmon, kallisto or Sailfish are using “effective length”, which takes into account the fact that not every transcript in the population can produce a fragment of every length starting at every position.
RPKM/FPKM normalize for both sequencing depth and gene length:
\[FPKM_i = \frac{C_i}{\frac{N}{10^6} \cdot \frac{L_i}{10^3}} = \frac{C_i}{L_i\cdot N} \cdot 10^9\]
In the next section you will see why RPKM/FPKM is not suitable for between sample comparison. As you may have already noticed, it’s because of the \(N\) varying between samples.
Taken from this great blog post:
TPM is very similar to RPKM and FPKM. The only difference is the order of operations. Here’s how you calculate TPM:
\[TPM_i = \frac{C_i}{L_i} \cdot \frac{10^6}{\sum{\frac{C_i}{L_i}}}\]
So you see, when calculating TPM, the only difference from FPKM is that you normalize for gene length first, and then normalize for sequencing depth second. However, the effects of this difference are quite profound.
When you use TPM, the sum of all TPMs in each sample are the same (\(\sum{\frac{C_i}{L_i}}\)). This makes it easier to compare the proportion of reads that mapped to a gene in each sample. In contrast, with RPKM and FPKM, the sum of the normalized reads (\(N\)) in each sample may be different, and this makes it harder to compare samples directly.
Here’s an example. If the TPM for gene A in Sample 1 is 3.33 and the TPM in sample B is 3.33, then I know that the exact same proportion of total reads mapped to gene A in both samples. This is because the sum of the TPMs in both samples always add up to the same number (so the denominator required to calculate the proportions is the same, regardless of what sample you are looking at.)
With RPKM or FPKM, the sum of normalized reads in each sample can be different. Thus, if the RPKM for gene A in Sample 1 is 3.33 and the RPKM in Sample 2 is 3.33, I would not know if the same proportion of reads in Sample 1 mapped to gene A as in Sample 2. This is because the denominator required to calculate the proportion could be different for the two samples.
While TPM and RPKM/FPKM normalization methods both account for sequencing depth and gene length, RPKM/FPKM are not recommended. The reason is that the normalized count values output by the RPKM/FPKM method are not comparable between samples.
Using RPKM/FPKM normalization, the total number of RPKM/FPKM normalized counts for each sample will be different. Therefore, you cannot compare the normalized counts for each gene equally between samples.
Deeper explanation of FPKM and TPM problems can be found here.
To calculate TPM, we need to know gene lengths. Probably the easiest way is to calculate a median of transcript lengths of each gene. We can use a package GenomicFeatures, which includes a special class TxDb holding information about genomic ranges. This object can be created from GTF or GFF file.
library(GenomicFeatures)
txdb <- makeTxDbFromGFF("../_data/genome/Homo_sapiens.GRCh37.75.gtf", format = "gtf")
# For each gene we get a GRanges object of its transcripts.
# This object holds genomic range of each transcript.
txs_by_gene <- transcriptsBy(txdb, "gene")
# We only want genes which are both in our DESeqDataSet and txs_by_gene.
common_genes <- intersect(rownames(dds), names(txs_by_gene))
dds <- dds[common_genes, ]
txs_by_gene <- txs_by_gene[common_genes]
head(txs_by_gene)
## GRangesList object of length 6:
## $ENSG00000000003
## GRanges object with 3 ranges and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] X 99883667-99891803 - | 194231 ENST00000373020
## [2] X 99887538-99891686 - | 194232 ENST00000496771
## [3] X 99888439-99894988 - | 194233 ENST00000494424
## -------
## seqinfo: 265 sequences (1 circular) from an unspecified genome; no seqlengths
##
## $ENSG00000000419
## GRanges object with 7 ranges and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] 20 49551404-49575087 - | 182563 ENST00000371588
## [2] 20 49551404-49575087 - | 182564 ENST00000466152
## [3] 20 49551404-49575092 - | 182565 ENST00000371582
## [4] 20 49551433-49562398 - | 182566 ENST00000494752
## [5] 20 49551482-49575058 - | 182567 ENST00000371584
## [6] 20 49551490-49575069 - | 182568 ENST00000371583
## [7] 20 49552685-49575069 - | 182569 ENST00000413082
## -------
## seqinfo: 265 sequences (1 circular) from an unspecified genome; no seqlengths
##
## $ENSG00000000457
## GRanges object with 5 ranges and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] 1 169818772-169863093 - | 15488 ENST00000367771
## [2] 1 169821804-169863076 - | 15489 ENST00000367772
## [3] 1 169822215-169858029 - | 15490 ENST00000367770
## [4] 1 169823652-169863408 - | 15491 ENST00000423670
## [5] 1 169828260-169863093 - | 15492 ENST00000470238
## -------
## seqinfo: 265 sequences (1 circular) from an unspecified genome; no seqlengths
##
## ...
## <3 more elements>
# We are only interested in transcript names.
# For each gene we get a list of its transcript names.
txs_by_gene <- lapply(txs_by_gene, function(x) {
mcols(x, use.names = TRUE)[, "tx_name"]
})
# We can use the parallelized version of lapply(), but it sometimes causes errors.
# txs_by_gene <- bplapply(txs_by_gene, function(x) {
# mcols(x, use.names = TRUE)[, "tx_name"]
# }, BPPARAM = BPPARAM)
head(txs_by_gene)
## $ENSG00000000003
## [1] "ENST00000373020" "ENST00000496771" "ENST00000494424"
##
## $ENSG00000000419
## [1] "ENST00000371588" "ENST00000466152" "ENST00000371582" "ENST00000494752" "ENST00000371584" "ENST00000371583" "ENST00000413082"
##
## $ENSG00000000457
## [1] "ENST00000367771" "ENST00000367772" "ENST00000367770" "ENST00000423670" "ENST00000470238"
##
## $ENSG00000000460
## [1] "ENST00000498289" "ENST00000472795" "ENST00000413811" "ENST00000496973" "ENST00000481744" "ENST00000359326" "ENST00000466580" "ENST00000456684"
## [9] "ENST00000459772" "ENST00000286031"
##
## $ENSG00000000938
## [1] "ENST00000374005" "ENST00000399173" "ENST00000545953" "ENST00000374004" "ENST00000374003" "ENST00000457296" "ENST00000475472" "ENST00000468038"
##
## $ENSG00000000971
## [1] "ENST00000439155" "ENST00000367429" "ENST00000496761" "ENST00000359637" "ENST00000466229" "ENST00000470918"
# For each transcript we get a GRanges object of its exons.
exons_by_tx <- exonsBy(x = txdb, by = "tx", use.names = TRUE)
head(exons_by_tx)
## GRangesList object of length 6:
## $ENST00000456328
## GRanges object with 3 ranges and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] 1 11869-12227 + | 1 ENSE00002234944 1
## [2] 1 12613-12721 + | 8 ENSE00003582793 2
## [3] 1 13221-14409 + | 12 ENSE00002312635 3
## -------
## seqinfo: 265 sequences (1 circular) from an unspecified genome; no seqlengths
##
## $ENST00000515242
## GRanges object with 3 ranges and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] 1 11872-12227 + | 2 ENSE00002234632 1
## [2] 1 12613-12721 + | 9 ENSE00003608237 2
## [3] 1 13225-14412 + | 13 ENSE00002306041 3
## -------
## seqinfo: 265 sequences (1 circular) from an unspecified genome; no seqlengths
##
## $ENST00000518655
## GRanges object with 4 ranges and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] 1 11874-12227 + | 3 ENSE00002269724 1
## [2] 1 12595-12721 + | 6 ENSE00002270865 2
## [3] 1 13403-13655 + | 14 ENSE00002216795 3
## [4] 1 13661-14409 + | 16 ENSE00002303382 4
## -------
## seqinfo: 265 sequences (1 circular) from an unspecified genome; no seqlengths
##
## ...
## <3 more elements>
# For each transcript we get a sum of lengths of its exons.
tx_lengths <- sum(width(exons_by_tx))
head(tx_lengths)
## ENST00000456328 ENST00000515242 ENST00000518655 ENST00000450305 ENST00000473358 ENST00000469289
## 1657 1653 1483 632 712 535
# For each gene we get a median length of its transcripts.
gene_lengths <- lapply(txs_by_gene, function(x) median(tx_lengths[x])) %>% unlist()
# gene_lengths <- bplapply(txs_by_gene, function(x) median(tx_lengths[x]), BPPARAM = BPPARAM) %>% unlist()
# We sort gene lengths by the order of dds rows.
gene_lengths <- gene_lengths[rownames(dds)]
head(gene_lengths)
## ENSG00000000003 ENSG00000000419 ENSG00000000457 ENSG00000000460 ENSG00000000938 ENSG00000000971
## 1025.0 1073.0 2916.0 1384.0 2131.5 1638.0
stopifnot(all(rownames(dds) == names(gene_lengths)))
Now we can use the gene_lengths to calculate the TPM values. Following code is just a matrix version of the TPM formula:
# First we calculate Ci/Li from the TPM formula.
x <- counts(dds) / gene_lengths
tpm <- t(t(x) * 1e6 / colSums(x))
head(tpm)
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003 29.998004 21.929473 32.29820620 25.184935 43.35055907 31.904105 37.909764 25.222508
## ENSG00000000419 19.708955 24.081390 21.94723949 21.522746 21.36066089 23.257916 19.611940 21.398341
## ENSG00000000457 4.037684 3.630521 3.42023741 3.558450 3.28061201 3.545392 4.032298 3.549475
## ENSG00000000460 1.963187 1.993887 1.09600346 1.600061 2.20056813 1.421766 2.771161 1.959437
## ENSG00000000938 0.000000 0.000000 0.03558219 0.000000 0.01831854 0.000000 0.000000 0.000000
## ENSG00000000971 89.877196 112.691182 143.00516667 164.241872 160.21246560 210.265115 159.464710 220.607704
Because the TPM matrix has the same dimension, rownames and colnames as the count matrix in dds, we can add it to assays:
assay(dds, "tpm") <- tpm
Taken from here, shortened:
Many common statistical methods for exploratory analysis of multidimensional data, for example clustering and principal components analysis (PCA), work best for data that generally has the same range of variance at different ranges of the mean values. For RNA-seq raw counts, however, the variance grows with the mean. For example, if one performs PCA directly on a matrix of size-factor-normalized read counts, the result typically depends only on the few most strongly expressed genes because they show the largest absolute differences between samples. A simple and often used strategy to avoid this is to take the logarithm of the normalized count values plus a small pseudocount; however, now the genes with the very lowest counts will tend to dominate the results because, due to the strong Poisson noise inherent to small count values, and the fact that the logarithm amplifies differences for the smallest values, these low count genes will show the strongest relative differences between samples.
As a solution, DESeq2 offers transformations for count data that stabilize the variance across the mean. One such transformation is the regularized-logarithm transformation or rlog2. For genes with high counts, the rlog transformation will give similar result to the ordinary log2 transformation of normalized counts. For genes with lower counts, however, the values are shrunken towards the genes’ averages across all samples. The rlog-transformed data can be used directly for computing distances between samples and making PCA plots.
Another transformation that similarly improves distance calculation across samples, the variance stabilizing transformation implemented in the vst() function, is discussed alongside the rlog in the DESeq2 vignette.
In rlog() function we specify blind = FALSE, which means that differences across donors and treatment should not add to the variance-mean profile of the experiment.
rld <- rlog(dds, blind = FALSE)
rld is object of class DESeqTransform. Counts are retrieved by assay() function. Later we will use the rld object for exploratory analysis, but not for differential expression testing. That’s why we have created an another object than dds.
We can now look what effect different transformations have. In the vsn package there is an useful function meanSdPlot() which plots mean and standard deviation of each gene across the samples.
library(ggplot2)
vsn::meanSdPlot(counts(dds, normalized = TRUE) + 1, plot = FALSE, rank = FALSE)$gg + ggtitle("Normalized counts")
vsn::meanSdPlot((counts(dds, normalized = TRUE) + 1) %>% log2(), plot = FALSE, rank = FALSE)$gg + ggtitle("log2(normalized counts + 1)")
vsn::meanSdPlot(assay(rld), plot = FALSE, rank = FALSE)$gg + ggtitle("rlog-normalized counts")
You can see that in normalized, but untransformed counts, genes with the highest mean also have the highest standard deviation. This is the mean-variance trend in RNA-seq data.
Log2 transformation makes it better, but now counts with the smallest values dominate in standard deviation, because log2 amplified their values the most.
Rlog transformation seems to be the best, as it correctly shrunk genes with low counts to average across all samples.
We can also look at a sample-to-sample count plot:
par(mfrow = c(1, 3))
plot(counts(dds, normalized = TRUE)[, 1:2] + 1, pch = 16, cex = 0.3)
plot((counts(dds, normalized = TRUE)[, 1:2] + 1) %>% log2(), pch = 16, cex = 0.3)
plot(assay(rld)[, 1:2], pch = 16, cex = 0.3)
Again, the mean-variance trend is visible in untransformed counts. Log2 transformation makes it better, but many low count genes now show the strongest relative differences between samples. Rlog correctly transforms them, as they mean across all samples is very low and they are just a technical noise.
Exploratory analysis will help us to assess the main characteristics of our data. We can check if our experiment makes sense at biological level and if it was done correctly at technical level. Don’t forget to use the rlog-transformed data (rld object).
Principal Component Analysis (PCA) is a technique used to emphasize variation and bring out strong patterns in a dataset (dimensionality reduction).
DESeq2 package has a plotPCA() function for that. You can specify the parameter intgroup to color points by column from the sample sheet.
plotPCA(rld, intgroup = "dex")
plotPCA(rld, intgroup = "cell")
It seems that samples are correctly divided by treatment. There is an evident difference between N080611 (blue) and other cell lines.
Note that plotPCA() is returning a ggplot2 object, which can be saved and further modified:
p <- plotPCA(rld, intgroup = "dex")
p + ggtitle("PCA plot") + theme_bw()
You can also only return a dataframe with principal components and add more aesthetics to the plot:
p <- plotPCA(rld, intgroup = "dex", returnData = TRUE)
head(p)
p <- data.frame(p, cell = rld$cell, SampleName = rld$SampleName)
head(p)
ggplot(p, aes(x = PC1, y = PC2)) +
# We want to show points, color them by cell and shape them by dex.
geom_point(aes(color = cell, shape = dex), size = 5) +
# One unit of x is equal to one unit of y.
coord_fixed() +
ggtitle("PCA plot") +
# Annotate points.
ggrepel::geom_text_repel(aes(label = SampleName)) +
# Black and white theme.
theme_bw()
Hierarchical clustering is clustering samples by their (Euclidean) distance. Heatmap is used to show how samples are clustered (dendrogram) and what is their distance from each other (color).
We will use a package heatmaply, which offers an interactive heatmap. But first we need to calculate a distance matrix of the samples.
sample_dists <- assay(rld) %>% t() %>% dist() %>% as.matrix()
head(sample_dists)
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521
## SRR1039508 0.00000 45.70156 39.25598 62.63597 44.50960 64.49866 39.58025 63.36519
## SRR1039509 45.70156 0.00000 54.91227 44.53083 59.06801 51.45301 57.46653 45.06067
## SRR1039512 39.25598 54.91227 0.00000 48.72886 43.58240 59.23395 36.74745 57.87995
## SRR1039513 62.63597 44.53083 48.72886 0.00000 63.74705 49.88392 58.49396 36.49798
## SRR1039516 44.50960 59.06801 43.58240 63.74705 0.00000 47.48508 46.41177 65.55040
## SRR1039517 64.49866 51.45301 59.23395 49.88392 47.48508 0.00000 63.60393 52.32102
library(heatmaply)
# We have to be sure that column order of the distance matrix is the same as in sample sheet.
treatment <- colData(dds)[colnames(sample_dists), "dex"]
treatment
## [1] untrt trt untrt trt untrt trt untrt trt
## Levels: untrt trt
cell_line <- colData(dds)[colnames(sample_dists), "cell"]
cell_line
## [1] N61311 N61311 N052611 N052611 N080611 N080611 N061011 N061011
## Levels: N052611 N061011 N080611 N61311
heatmaply(sample_dists, col_side_colors = treatment, row_side_colors = cell_line)
You can see that clustering of the samples has an interpretable biological meaning - they cluster by the treatment.
Instead of the distance matrix we can also calculate a sample correlation matrix:
sample_cor <- assay(rld) %>% cor()
sample_cor
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521
## SRR1039508 1.0000000 0.9982056 0.9986694 0.9966427 0.9982853 0.9964280 0.9986452 0.9965726
## SRR1039509 0.9982056 1.0000000 0.9974051 0.9983024 0.9969986 0.9977279 0.9971577 0.9982657
## SRR1039512 0.9986694 0.9974051 1.0000000 0.9979696 0.9983604 0.9969871 0.9988328 0.9971379
## SRR1039513 0.9966427 0.9983024 0.9979696 1.0000000 0.9965235 0.9978695 0.9970713 0.9988612
## SRR1039516 0.9982853 0.9969986 0.9983604 0.9965235 1.0000000 0.9980705 0.9981374 0.9963322
## SRR1039517 0.9964280 0.9977279 0.9969871 0.9978695 0.9980705 1.0000000 0.9965255 0.9976596
## SRR1039520 0.9986452 0.9971577 0.9988328 0.9970713 0.9981374 0.9965255 1.0000000 0.9978600
## SRR1039521 0.9965726 0.9982657 0.9971379 0.9988612 0.9963322 0.9976596 0.9978600 1.0000000
# Order im correlation matrix could be different, so better is to get the coloring values again.
treatment <- colData(dds)[colnames(sample_dists), "dex"]
cell_line <- colData(dds)[colnames(sample_dists), "cell"]
heatmaply(sample_cor, col_side_colors = treatment, row_side_colors = cell_line)
All samples show pretty high correlations and this suggests there are no outlying samples. Again, they cluster by treatment.
Together, these plots suggest to us that the data make biologically sense and we can safely proceed to the differential expression analysis.
Boxplots are a great way to graphically show differences between groups. Rlog-transformed data are used for this purpose. However, always use a proper statistical model (in our case DESeq2) to test for differential expression. Boxplots are great for exploring, but not objective for assessing the differences.
We will use ggboxplot() function from the ggpubr package. This package provides several useful publication-ready ggplot plots.
However, ggplot is generally aimed on plotting data in long format, where each observation (i.e. gene, counts, sample ID, sample group, etc.) has its own row. Our data (count matrix) is in wide format: each value has only gene and sample ID assigned.
We will use the package tidyr to convert count matrix in wide format to long format.
NOTE: Long format consumes much more memory.
count_matrix <- assay(rld)
# First we add gene annotation from rowData.
count_matrix <- cbind(count_matrix, rowData(rld)[, c("SYMBOL", "GENENAME")]) %>% as.data.frame()
# tidyr::pivot_longer() function is converting wide data to the long format.
# key = name of column in long data with column names of wide data
# value = name of column in long data with values of wide data
# The last arguments indicate columns in wide data which don't contain values.
long_data <- tidyr::pivot_longer(
count_matrix,
# We want to convert ("expand") these columns to the long format.
-c(SYMBOL, GENENAME),
# We want to add observation IDs (colnames) to this column.
names_to = "sample_id",
# We want to add observation values (colnames) to this column.
values_to = "rlog_count"
)
# We can remove the temporary count matrix to free memory.
rm(count_matrix)
head(long_data)
As you can see, for each sample and gene we have its counts and annotation on separated row. But we are still missing the data from the sample sheet. To do it, we will join our long data with the sample sheet, using the sample_id column. left_join(x, y) function from dplyr package is doing that:
Return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
Visual example of joining two tables. Table 1 and Table 2 could be our long data and sample sheet. id column is similar to our Sample_ID column. In the resulting table on bottom, to Table 1 there are added rows and columns of the rows whose id matches in Table 2.
sample_sheet <- colData(rld) %>% as.data.frame()
sample_sheet$sample_id <- rownames(sample_sheet)
long_data <- dplyr::left_join(long_data, sample_sheet, by = "sample_id")
# We will also remove rows with NA gene symbol.
long_data <- long_data[!is.na(long_data$SYMBOL), ]
head(long_data)
Now our long data are complete and we can proceed to boxplots. Note that we want to work with single genes, so we have to filter our long data. First we will plot boxplots of cell lines and treatments:
plot_boxplot_ggplot2(
long_data[long_data$SYMBOL %in% c("CCND1", "CCND2", "CCND3"), ],
x = "dex",
y = "rlog_count",
feature_col = "SYMBOL",
color_by = "dex",
main = "Boxplot of rlog transformed counts",
x_lab = "Sample group",
y_lab = "rlog(counts)",
do_t_test = FALSE
)
plot_boxplot_ggplot2(
long_data[long_data$SYMBOL == "CCND1", ],
x = "cell",
y = "rlog_count",
feature_col = "SYMBOL",
feature = "CCND1",
color_by = "cell",
facet_by = "dex",
main = "Boxplot of rlog transformed counts",
x_lab = "Cell line",
y_lab = "rlog(counts)",
do_t_test = FALSE
)
However, our data contains very few samples in each group, so in this case boxplots don’t make much sense.
It’s a good practice to include all warnings (warnings()) and errors (traceback()), and also information about your environment - R version, libraries, etc. (sessionInfo()) - on the end of your script. We also save our current variables to a single file from which we can later restore them.
save.image("03_exploratory_analysis.RData")
warnings()
traceback()
## No traceback available
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.7.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.7.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
## [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] friendlyeval_0.1.1 heatmaply_1.1.0 viridis_0.5.1 viridisLite_0.3.0 plotly_4.9.2.1
## [6] ggplot2_3.3.0 GenomicFeatures_1.38.2 org.Hs.eg.db_3.10.0 AnnotationDbi_1.48.0 DESeq2_1.26.0
## [11] airway_1.6.0 SummarizedExperiment_1.16.1 DelayedArray_0.12.3 matrixStats_0.56.0 Biobase_2.46.0
## [16] GenomicRanges_1.38.0 GenomeInfoDb_1.22.1 IRanges_2.20.2 S4Vectors_0.24.4 BiocGenerics_0.32.0
## [21] magrittr_1.5 BiocParallel_1.20.1 emo_0.0.0.9000 glue_1.4.0 knitr_1.28
## [26] rmarkdown_2.1
##
## loaded via a namespace (and not attached):
## [1] backports_1.1.6 Hmisc_4.4-0 BiocFileCache_1.10.2 plyr_1.8.6 lazyeval_0.2.2 splines_3.6.3
## [7] crosstalk_1.1.0.1 usethis_1.6.1 digest_0.6.25 foreach_1.5.0 htmltools_0.4.0 gdata_2.18.0
## [13] fansi_0.4.1 checkmate_2.0.0 memoise_1.1.0 cluster_2.1.0 gclus_1.3.2 limma_3.42.2
## [19] remotes_2.1.1 Biostrings_2.54.0 annotate_1.64.0 askpass_1.1 prettyunits_1.1.1 jpeg_0.1-8.1
## [25] colorspace_1.4-1 blob_1.2.1 rappdirs_0.3.1 ggrepel_0.8.2 xfun_0.13 dplyr_0.8.5
## [31] callr_3.4.3 crayon_1.3.4 RCurl_1.98-1.2 jsonlite_1.6.1 hexbin_1.28.1 genefilter_1.68.0
## [37] iterators_1.0.12 survival_3.1-12 registry_0.5-1 gtable_0.3.0 zlibbioc_1.32.0 XVector_0.26.0
## [43] webshot_0.5.2 pkgbuild_1.0.7 scales_1.1.0 vsn_3.54.0 DBI_1.1.0 Rcpp_1.0.4.6
## [49] xtable_1.8-4 progress_1.2.2 htmlTable_1.13.3 foreign_0.8-76 bit_1.1-15.2 preprocessCore_1.48.0
## [55] Formula_1.2-3 htmlwidgets_1.5.1 httr_1.4.1 gplots_3.0.3 RColorBrewer_1.1-2 acepack_1.4.1
## [61] ellipsis_0.3.0 pkgconfig_2.0.3 XML_3.99-0.3 farver_2.0.3 nnet_7.3-14 dbplyr_1.4.3
## [67] locfit_1.5-9.4 reshape2_1.4.4 tidyselect_1.0.0 labeling_0.3 rlang_0.4.5 munsell_0.5.0
## [73] tools_3.6.3 cli_2.0.2 generics_0.0.2 RSQLite_2.2.0 devtools_2.3.0 evaluate_0.14
## [79] stringr_1.4.0 yaml_2.2.1 processx_3.4.2 bit64_0.9-7 fs_1.4.1 caTools_1.18.0
## [85] purrr_0.3.4 dendextend_1.13.4 biomaRt_2.42.1 compiler_3.6.3 rstudioapi_0.11 curl_4.3
## [91] png_0.1-7 ggsignif_0.6.0 testthat_2.3.2 affyio_1.56.0 tibble_3.0.1 geneplotter_1.64.0
## [97] stringi_1.4.6 ps_1.3.2 desc_1.2.0 lattice_0.20-41 Matrix_1.2-18 vctrs_0.2.4
## [103] pillar_1.4.3 lifecycle_0.2.0 BiocManager_1.30.10 data.table_1.12.8 bitops_1.0-6 seriation_1.2-8
## [109] rtracklayer_1.46.0 R6_2.4.1 latticeExtra_0.6-29 affy_1.64.0 TSP_1.1-10 KernSmooth_2.23-17
## [115] gridExtra_2.3 codetools_0.2-16 sessioninfo_1.1.1 gtools_3.8.2 MASS_7.3-51.6 assertthat_0.2.1
## [121] pkgload_1.0.2 openssl_1.4.1 rprojroot_1.3-2 withr_2.2.0 GenomicAlignments_1.22.1 Rsamtools_2.2.3
## [127] GenomeInfoDbData_1.2.2 hms_0.5.3 grid_3.6.3 rpart_4.1-15 tidyr_1.0.2 ggpubr_0.2.5
## [133] lubridate_1.7.8 base64enc_0.1-3
This chunk is not evaluated (eval = FALSE). Otherwise you will probably end up in recursive hell 🤯
library(rmarkdown)
library(knitr)
library(glue)
# You can set global chunk options. Options set in individual chunks will override this.
opts_chunk$set(warning = FALSE, message = FALSE)
render("03_exploratory_analysis.Rmd", output_file = "03_exploratory_analysis.html")